Corpus tools for lexicographers
نویسندگان
چکیده
To analyse corpus data, lexicographers need software that allows them to search, manipulate and save data, a 'corpus tool'. A good corpus tool is the key to a comprehensive lexicographic analysis—a corpus without a good tool to access it is of little use. Both corpus compilation and corpus tools have been swept along by general technological advances over the last three decades. Compiling and storing corpora has become far faster and easier, so corpora tend to be much larger than previous ones. Most of the first COBUILD dictionary was produced from a corpus of eight million words. Several of the leading English dictionaries of the 1990s were produced using the British National Corpus (BNC), of 100 million words. Current lexico-graphic projects we are involved in use corpora of around a billion words—though this is still less than one hundredth of one percent of the English language text available on the Web (see Rundell, this volume). The amount of data to analyse has thus increased significantly, and corpus tools have had to be improved to assist lexicographers in adapting to this change. Corpus tools have become faster, more multifunctional, and customizable. In the COBUILD project, getting concordance output took a long time and then the concordances were printed on paper and handed out to lexicographers (Clear 1987). Today, with Google as a point of comparison, concordancing needs to be instantaneous, with the analysis taking place on the computer screen. Moreover, larger corpora offer much higher numbers of concordance lines per word (especially for high-frequency words), and, considering the time constraints of the lexicographers (see Rundell, this volume), new features of data summarization are required to ease and speed the analysis. In this chapter, we review the functionality of corpus tools used by lexicographers. In Section 3.2, we discuss the procedures in corpus preparation that are required for some of these features to work. In Section 3.3, we briefly describe some leading tools
منابع مشابه
Providing Lexicographers with Corpus Evidence for Fine-grained Syntactic Descriptions: Adjectives Taking Subject and Complement Clauses
This article deals with techniques for lexical acquisition which allow lexicographers to extract evidence for fine-grained syntactic descriptions of words from corpora. The extraction tools are applied to partially parsed text corpora, and aim to provide the lexicographer with easy to use syntactically pre-classified evidence. As an example we extracted German adjectives taking subject and comp...
متن کاملTools for Upgrading Printed Dictionaries by Means of Corpus-based Lexical Acquisition
We present the architecture and tools developed in the project TFB-32 for updating existing dictionaries by comparing their content with corpus data. We focus on an interactive graphical user interface for manual selection of the results of this comparison. The tools have been developed and used within a cooperation with lexicographers from two German publishing houses.
متن کاملDetection of Domain Specific Terminology Using Corpora Comparison
Identifying terms in specialized corpora is a central task in terminological work (compilation of domain-specific dictionaries), but is labour-intensive, especially when the corpora are voluminous which is often the case nowadays. For the past decade, terminologists and specialized lexicographers have been able to rely on term-extraction tools to assist them in the selection of terms. However, ...
متن کاملWASP-Bench: a Lexicographic Tool Supporting Word Sense Disambiguation
We present WASP-Bench: a novel approach to Word Sense Disambiguation, also providing a semi-automatic environment for a lexicographer to compose dictionary entries based on corpus evidence. For WSD, involving lexicographers tackles the twin obstacles to high accuracy: paucity of training data and insufficiently explicit dictionaries. For lexicographers, the computational environment fills the n...
متن کاملThe Berkeley FrameNet Project
FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a) a commitment to corpus evidence for semantic and syntactic generalizations, and (b) the representation of the valences of its target words (mostly nouns, adjectives, and verbs) in which the semantic ...
متن کامل